actual performance
Do Code Models Suffer from the Dunning-Kruger Effect?
Singh, Mukul, Chatterjee, Somya, Radhakrishna, Arjun, Gulwani, Sumit
As artificial intelligence systems increasingly collaborate with humans in creative and technical domains, questions arise about the cognitive boundaries and biases that shape our shared agency. This paper investigates the Dunning-Kruger Effect (DKE), the tendency for those with limited competence to overestimate their abilities in state-of-the-art LLMs in coding tasks. By analyzing model confidence and performance across a diverse set of programming languages, we reveal that AI models mirror human patterns of overconfidence, especially in unfamiliar or low-resource domains. Our experiments demonstrate that less competent models and those operating in rare programming languages exhibit stronger DKE-like bias, suggesting that the strength of the bias is proportionate to the competence of the models.
When Is Prior Knowledge Helpful? Exploring the Evaluation and Selection of Unsupervised Pretext Tasks from a Neuro-Symbolic Perspective
Jia, Lin-Han, Han, Si-Yu, Hu, Wen-Chao, Shao, Jie-Jing, Wei, Wen-Da, Zhou, Zhi, Guo, Lan-Zhe, Li, Yu-Feng
Neuro-symbolic (Nesy) learning improves the target task performance of models by enabling them to satisfy knowledge, while semi/self-supervised learning (SSL) improves the target task performance by designing unsupervised pretext tasks for unlabeled data to make models satisfy corresponding assumptions. We extend the Nesy theory based on reliable knowledge to the scenario of unreliable knowledge (i.e., assumptions), thereby unifying the theoretical frameworks of SSL and Nesy. Through rigorous theoretical analysis, we demonstrate that, in theory, the impact of pretext tasks on target performance hinges on three factors: knowledge learn-ability with respect to the model, knowledge reliability with respect to the data, and knowledge completeness with respect to the target. We further propose schemes to operationalize these theoretical metrics, and thereby develop a method that can predict the effectiveness of pretext tasks in advance. This will change the current status quo in practical applications, where the selections of unsupervised tasks are heuristic-based rather than theory-based, and it is difficult to evaluate the rationality of unsupervised pretext task selection before testing the model on the target task. In experiments, we verify a high correlation between the predicted performance--estimated using minimal data--and the actual performance achieved after large-scale semi-supervised or self-supervised learning, thus confirming the validity of the theory and the effectiveness of the evaluation method.
Interpretable Droplet Digital PCR Assay for Trustworthy Molecular Diagnostics
Wei, Yuanyuan, Wu, Yucheng, Qu, Fuyang, Mu, Yao, Ho, Yi-Ping, Ho, Ho-Pui, Yuan, Wu, Xu, Mingkun
Accurate molecular quantification is essential for advancing research and diagnostics in fields such as infectious diseases, cancer biology, and genetic disorders. Droplet digital PCR (ddPCR) has emerged as a gold standard for achieving absolute quantification. While computational ddPCR technologies have advanced significantly, achieving automatic interpretation and consistent adaptability across diverse operational environments remains a challenge. To address these limitations, we introduce the intelligent interpretable droplet digital PCR (I2ddPCR) assay, a comprehensive framework integrating front-end predictive models (for droplet segmentation and classification) with GPT-4o multimodal large language model (MLLM, for context-aware explanations and recommendations) to automate and enhance ddPCR image analysis. This approach surpasses the state-of-the-art models, affording 99.05% accuracy in processing complex ddPCR images containing over 300 droplets per image with varying signal-to-noise ratios (SNRs). By combining specialized neural networks and large language models, the I2ddPCR assay offers a robust and adaptable solution for absolute molecular quantification, achieving a sensitivity capable of detecting low-abundance targets as low as 90.32 copies/{\mu}L. Furthermore, it improves model's transparency through detailed explanation and troubleshooting guidance, empowering users to make informed decisions. This innovative framework has the potential to benefit molecular diagnostics, disease research, and clinical applications, especially in resource-constrained settings.
Performance Prediction for Multi-hop Questions
Samadi, Mohammadreza, Rafiei, Davood
We study the problem of Query Performance Prediction (QPP) for open-domain multi-hop Question Answering (QA), where the task is to estimate the difficulty of evaluating a multi-hop question over a corpus. Despite the extensive research on predicting the performance of ad-hoc and QA retrieval models, there has been a lack of study on the estimation of the difficulty of multi-hop questions. The problem is challenging due to the multi-step nature of the retrieval process, potential dependency of the steps and the reasoning involved. To tackle this challenge, we propose multHP, a novel pre-retrieval method for predicting the performance of open-domain multi-hop questions. Our extensive evaluation on the largest multi-hop QA dataset using several modern QA systems shows that the proposed model is a strong predictor of the performance, outperforming traditional single-hop QPP models. Additionally, we demonstrate that our approach can be effectively used to optimize the parameters of QA systems, such as the number of documents to be retrieved, resulting in improved overall retrieval performance.
From Contextual Data to Newsvendor Decisions: On the Actual Performance of Data-Driven Algorithms
Besbes, Omar, Ma, Will, Mouchtaki, Omar
In this work, we explore a framework for contextual decision-making to study how the relevance and quantity of past data affects the performance of a data-driven policy. We analyze a contextual Newsvendor problem in which a decision-maker needs to trade-off between an underage and an overage cost in the face of uncertain demand. We consider a setting in which past demands observed under ``close by'' contexts come from close by distributions and analyze the performance of data-driven algorithms through a notion of context-dependent worst-case expected regret. We analyze the broad class of Weighted Empirical Risk Minimization (WERM) policies which weigh past data according to their similarity in the contextual space. This class includes classical policies such as ERM, k-Nearest Neighbors and kernel-based policies. Our main methodological contribution is to characterize exactly the worst-case regret of any WERM policy on any given configuration of contexts. To the best of our knowledge, this provides the first understanding of tight performance guarantees in any contextual decision-making problem, with past literature focusing on upper bounds via concentration inequalities. We instead take an optimization approach, and isolate a structure in the Newsvendor loss function that allows to reduce the infinite-dimensional optimization problem over worst-case distributions to a simple line search. This in turn allows us to unveil fundamental insights that were obfuscated by previous general-purpose bounds. We characterize actual guaranteed performance as a function of the contexts, as well as granular insights on the learning curve of algorithms.
Dynamic Ensemble of Low-fidelity Experts: Mitigating NAS "Cold-Start"
Zhao, Junbo, Ning, Xuefei, Liu, Enshu, Ru, Binxin, Zhou, Zixuan, Zhao, Tianchen, Chen, Chen, Zhang, Jiajin, Liao, Qingmin, Wang, Yu
Predictor-based Neural Architecture Search (NAS) employs an architecture performance predictor to improve the sample efficiency. However, predictor-based NAS suffers from the severe ``cold-start'' problem, since a large amount of architecture-performance data is required to get a working predictor. In this paper, we focus on exploiting information in cheaper-to-obtain performance estimations (i.e., low-fidelity information) to mitigate the large data requirements of predictor training. Despite the intuitiveness of this idea, we observe that using inappropriate low-fidelity information even damages the prediction ability and different search spaces have different preferences for low-fidelity information types. To solve the problem and better fuse beneficial information provided by different types of low-fidelity information, we propose a novel dynamic ensemble predictor framework that comprises two steps. In the first step, we train different sub-predictors on different types of available low-fidelity information to extract beneficial knowledge as low-fidelity experts. In the second step, we learn a gating network to dynamically output a set of weighting coefficients conditioned on each input neural architecture, which will be used to combine the predictions of different low-fidelity experts in a weighted sum. The overall predictor is optimized on a small set of actual architecture-performance data to fuse the knowledge from different low-fidelity experts to make the final prediction. We conduct extensive experiments across five search spaces with different architecture encoders under various experimental settings. Our method can easily be incorporated into existing predictor-based NAS frameworks to discover better architectures.
How to test ML models in the real world
How often do you test ML models in a Jupyter notebook, get good results, but still cannot convince your boss that the model should be used right away? Or maybe you manage to convince her and put the model in production, but you do not see any impact on business metrics? Luckily for you, there are better ways to test ML models in the real world and to convince everyone (including you) that they add value to the business. In this article you will learn what these evaluation methods are, how to implement them, and when should you use each. We, data scientists and ML engineers, develop and test ML models in our local development environment, for example, a Jupyter notebook.
Why the high accuracy in classification is not always correct?
Classification accuracy is a statistic that describes a classification model's performance by dividing the number of correct predictions by the total number of predictions. It is simple to compute and comprehend, making it the most often used statistic for assessing classifier models. But not in every scenario accuracy score is to be considered the best metric to evaluate the model. In this article, we will discuss the reasons not to believe in the accuracy performance parameter completely. Following are the topics to be covered.
AI in Medicine -- Prospective versus Retrospective
Just like Sedol Lee was defeated by AlphaGo three or four years ago, there was an atmosphere that artificial intelligence would replace experts in medicine and replace everything in the world. The achievements of AI in the medical field were recorded one by one in an IEEE Spectrum ("AI versus Doctor"; https://ieeexplore.ieee.org/document/8048826). However, since a year or two ago, the main focus has moved to the role of artificial intelligence as an assistance tool for experts, and recently, it is not uncommon to hear that artificial intelligence is not making a profit in business. Even IBM's Watson was sold with some criticism. There may be a problem in some way, so why are we hearing these news?